Precis 


Poster 


Agenda 


The  World’s  First  Commercially-Available  Stream  Processor:  Architecture, 

Algorithms  and  Benchmark  Results 

Simon  Mclntosh-Smith 
ClearSpeed  Technology,  Ltd. 

Email  Address:  simon@clearspeed.com 

Ron  Bell 

AWE  -  Aldermaston 
Email  Address:  ron.bell@awe.co.uk 


Abstract: 

This  briefing  describes  new  applications  for  ClearSpeed’s  CS301  device,  the  first  commercially 
available  stream  processor.  Launched  in  October  2003,  the  CS301  is  an  ultra-high  performance 
next-generation  Single-Instruction/Multiple-Data  (SIMD)  stream  processor,  delivering  25 
GFLOPS  and  12.8  GMACS  at  1.8  Watts.  The  CS301’s  low-power,  Multi-Threaded  Array 
Processor  (MTAP)  architecture  scales  to  hundreds  and  ultimately  thousands  of  processing 
elements,  each  with  both  floating  point  and  integer  hardware,  capable  of  data  parallel  processing 
on  image  and  signal  processing  applications  as  well  as  for  compression,  encryption,  search,  and 
general  sensor  processing  applications.  The  processor  is  supported  by  a  flexible  development 
environment,  including  assembly  language  and  C-based  language  support,  as  well  as  a  cycle 
accurate  simulator,  with  plans  to  develop  industry  standard  API  Libraries  such  as  L3  BLAS  and 
FFTW.  This  new  class  of  stream  processor  has  been  shown  to  provide  ten  to  one  hundred  times 
the  overall  performance  of  PowerPC  or  Pentium-based  architectures,  especially  when  performing 
image  and  signal  processing  functions,  such  as  FFTs  or  filters.  In  general,  the  architecture  has 
been  shown  to  provide  significant  throughput,  size,  and  power  advantages  for  embedded 
processing  applications. 


AWE  Aldermaston  has  been  investigating  potential  uses  for  CS301 -class  processors  in  its  key 
algorithms  and  applications.  AWE  further  optimised  fast  math  library  routines  on  the  CS301  for 
SGEMM  -  a  single  precision  floating  point  matrix  multiply,  verifying  the  CS301’s  record- 
breaking  math  performance.  AWE  took  matrix  mutiply  from  5  GFLOPS  sustained  to  over  12 
GFLOPS  sustained  on  a  single  CS301.  AWE  is  performing  ongoing  work  exploring  the 
acceleration  potential  of  the  CS301  for  several  in-house  and  3rd  party  scientific  codes,  such  as 
DL-POLY. 

CS301 -based  accelerator  boards  having  been  shipping  since  January  2003  and  multiple 
algorithms  have  been  ported,  with  more  underway.  A  dual-processor  PCI-based  development 
card  is  available  from  ClearSpeed,  providing  a  total  of  50  GFLOPS  of  compute  performance,  for 
a  total  maximum  power  consumption  for  the  board  of  10  Watts.  Single  systems  containing  up  to 
5  boards  have  been  demonstrated  for  a  total  compute  of  500  GFLOPS  and  capable  of  1  Million 
FFTs  per  second  (IK  complex  single  precision  floating-point).  This  level  of  compute  density  has 
never  before  been  commercially  available,  with  the  CS301  delivering  more  than  10  GFLOPS  per 
Watt. 
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In  the  first  half  of  this  briefing  ClearSpeed  will  present  performance  results  from  the  numerous 
algorithms  and  applications  that  have  or  are  being  ported  to  the  CS301  stream  processor.  These 
include  numerous  sizes  of  FFTs  and  FIR  filters  which  efficiently  utilize  the  architecture  and 
floating  point  per  PE  hardware  to  gain  exceptional  performance  at  very  low  power  dissipation 
levels.  We  will  include  an  update  on  improvements  to  work  previously  announced  jointly  with 
Lockheed  Martin  at  HPEC03  on  pulse  compressions  for  radar  (FFT  -  Complex  Multiply  - 
IFFT).  The  results  to  be  reported  are  significantly  higher  than  other  industry  standard  processing 
and  DSP  platforms.  New  work  on  other  transforms,  such  as  DCTs,  will  also  be  presented.  In 
addition,  results  from  work  to  develop  a  Level  3  BLAS  (Basic  Linear  Algebra  Subprograms) 
library  will  be  reported,  including  performance  of  certain  vector  and  matrix  operations,  such  as 
matrix  multiplication  and  matrix  inversion,  including  descriptions  of  the  algorithms  required  on 
this  high-performance,  highly  parallel  architecture. 

In  the  second  half  of  this  briefing  AWE  Aldermaston  will  present  its  work  to  benchmark  the 
CS301.  The  briefing  will  include  descriptions  of  optimizations  to  a  fast  matrix  multiply 
algorithm  for  the  MTAP  streaming  architecture,  improving  performance  from  5  GFLOPS  to  over 
12  GFLOPS  on  the  CS301.  AWE  will  also  describe  its  investigations  into  using  the  CS301  to 
accelerate  certain  applications  used  in-house,  such  as  the  materials  science  code  DL-POLY. 
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Multi-threaded  Array  Processing 

-  Programmed  in  high-level  languages 

-  Hardware  multi-threading 

•  Enables  simultaneous  data  streaming 
and  computation  for  latency  tolerance 

-  Run-time  extensible  instruction  set 

Array  of  Processors  Elements 

-  PEs  are  VLIW  cores 

-  Flexible  data  parallel  processing 

-  Built-in  PE  fault  tolerance,  resiliency 

High  performance,  low  power 

-  lOGFLOPS/Watt 

Multiple  high  bandwidth  I/O  channels 
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Each  PE  is  a  VLIW  processor: 

•  Multiple  execution  units 

•Floating  point  adder  |  ej2_t>jt  lEEE  754 

•  Floating  point  multiplier  J 

•  Divide/square  root  unit 

•  Fixed  point  MAC  8x8->  16+48 

•  Integer  ALU  with  shifter 

•  Load/store 

•  High-bandwidth,  5-port  register  file  (3r,  2w) 

•  Closely  coupled  4KB  SRAM  for  data 

•  High  bandwidth  per  PE  load/store  (PIO) 

•  Per  PE  address  generator 

•  Complete  pointer  model,  including  parallel 
pointer  chasing  and  vectors  of  addresses 
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•  2  chip  board  -  50  GFLOPS  peak  @  10W  total 

•  200K  FFTs/s  (IK  complex  single  precision  IEEE754) 

•  Up  to  1GB  DRAM  for  local  processing 

•  Shipping  since  1Q04 

•  Single  slot  width  full-size  PCI  card 


ClearSpeed 


©  ClearSpeed  2004  I  www.clearspeed.com 


What  Applications  Can  Be  Accelerated? 


5 


Any  applications  with  significant  data  parallelism : 

•  Fine-grained  -  vector  operations 

•  Medium-grained  -  unrolled  independent  loops 

•  Coarse-grained  -  multiple  simultaneous  data  channels/sets 

Example  applications  and  libraries  include: 

•  Math  libraries  -  BLAS,  LAPACK  (-►  Matlab,  Maple,  ...) 

•  Chemistry  -  GROMACS,  CHARMM,  BLAST,  DLPOLY,  . . . 

•  Computational  finance  -  Monte  Carlo,  genetic  algorithms 

•  Intelligent  systems  -  artificial  neural  networks 

•  Signal  processing  -  FFT  (ID,  2D,  3D),  FIR 

•  Simulation  -  CFD,  N-body,  Finite  Element 

•  Image  processing  -  filtering,  image  recognition,  DCTs 
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Software  Development  Kit  (SDK) 


C  compiler,  assembler,  libraries,  visual  debugger,  etc. 
CS301 -based  development  boards 
Available  for  Linux  and  Windows 
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Applications  and  libraries  under  development 

•  Math  -  L3  BLAS,  LAPACK 

•  DSP-FFTs(1D,  2D,  3D) 

•  Bio/Chemistry  -  GROMACS,  DLPOLY,  Docklt 

•  Financial  -  random  number  generation,  Monte  Carlo 
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void  daxpy (double  *c,  double  *a,  double  alpha,  uint  N)  { 
uint  i; 

for  (i=0;  i<N;  i++) 

c[i]  =c[i]  +  a[i]*alpha; 

} 


void  daxpy (double  *c,  double  *a,  double  alpha,  uint  N)  { 
uint  i; 

double  cp,  ap; 
for  (i=0;  i<N;  i+=num_pes )  { 

memcpym2p ( &cp ,  &c [i+pe_num] ,  sizeof (double) ) ; 
memcpym2p ( &ap ,  &a [i+pe_num] ,  sizeof (double) ) ; 
cp  =  op  +  ap*alpha; 

memcpyp2m(&c[i+pe_num] ,  &cp,  sizeof (double) ) 

} 


poly 
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•  Chemistry  codes:  DLPOLY  (Molecular  Dynamics) 

-  Owned  by  UK  Daresbury  Lab,  heavily  used  at  AWE 

-  Widely  used  in  academia  and  industry 

-  91%  of  CPU  in  5  relatively  small  routines 

-  One  of  these  (forces)  calls  the  other  4  to  compute 
forces  on  all  atoms 

-  “forces”  called  once  per  time  step 

-  Data  needing  to  be  returned  by  “forces”  from  CS  to 
host  relatively  small 

-  Calculation  for  each  atom  is  independent 

•  Matrix  Multiply  Benchmark  (SGEMM) 

-  CS301  single  precision  code  started  at  -20%  efficiency 

-  AWE  helped  CS  restructure  code  to  give  12  GFLOPS  -  47% 

-  Performance  verified  by  AWE  on  CS301  hardware 

-  Next-generation  processor  from  ClearSpeed  significantly 
increases  this  performance  -  “Avebury” 
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Multi-threaded  Array  Processing 

-  Programmed  in  high-level  languages 

-  Hardware  multi-threading 

•  Enables  simultaneous  data  streaming 
and  computation  for  latency  tolerance 

-  Run-time  extensible  instruction  set 

Array  of  Processors  Elements 

-  PEs  are  VLIW  cores 

-  Flexible  data  parallel  processing 

-  Built-in  PE  fault  tolerance,  resiliency 

High  performance,  low  power 

-  lOGFLOPS/Watt 


•  Multiple  high  bandwidth  I/O  channels 
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CS301  Based  Development  Board 
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•  50  GFLOPS  peak  @  10W  maximum 

•  200K  FFTs/s  (IK  complex  single  precision  IEEE754) 

•  Up  to  1GB  DRAM  for  local  processing 

•  Single  slot  width  full-size  PCI  card 

•  In  evaluation  use  since  early  2004 
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Chemistry  codes:  DLPOLY  (Molecular  Dynamics) 

Owned  by  UK  Daresbury  Laboratory 
Widely  used  within  AWE,  also  academia  &  industry 
91%  of  CPU  time  in  5  small  routines 
One  calls  the  other  4  to  compute  forces  on  all  atoms 
Forces  called  once  per  time  step 
Small  amount  of  data  returned  by  forces  from  CS  to  host 
Calculation  for  each  atom  is  independent 


Matrix  Multiply  Benchmark  (SGEMM) 

•  CS301  single  precision  code  started  at  -20%  efficiency 

•  AWE/CS  code  restructuring  gave  12  GFLOPS  -  47% 

•  Performance  verified  by  AWE  on  CS301  hardware 

•  “Avebury”  significantly  increases  this  performance 
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“Avebury" 

•  50  GFLOPS  32/64-bit 

•  64-bit  addressing 

•  launched  Oct  2004 

☆ 

CS301 

•  25  GFLOPS  32-bit 

•  32-bit  addressing 

•  launched  Oct  2003 
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